Goto

Collaborating Authors

 tts technology


Optimizing Multilingual Text-To-Speech with Accents & Emotions

Pawar, Pranav, Dwivedi, Akshansh, Boricha, Jenish, Gohil, Himanshu, Dubey, Aditya

arXiv.org Artificial Intelligence

State-of-the-art text-to-speech (TTS) systems realize high naturalness in monolingual environments, synthesizing speech with correct multilingual accents (especially for Indic languages) and context-relevant emotions still poses difficulty owing to cultural nuance discrepancies in current frameworks. This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling, in particularly tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture, and culture-sensitive emotion embedding layers trained on native speaker corpora, as well as incorporating a dynamic accent code switching with residual vector quantization. Quantitative tests demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction from 15.4% to 11.8%) and 85.3% emotion recognition accuracy from native listeners, surpassing METTS and VECL-TTS baselines. The novelty of the system is that it can mix code in real time - generating statements such as "Namaste, let's talk about " with uninterrupted accent shifts while preserving emotional consistency. Subjective evaluation with 200 users reported a mean opinion score (MOS) of 4.2/5 for cultural correctness, much better than existing multilingual systems (p<0.01). This research makes cross-lingual synthesis more feasible by showcasing scalable accent-emotion disentanglement, with direct application in South Asian EdTech and accessibility software.


A review-based study on different Text-to-Speech technologies

Chowdhury, Md. Jalal Uddin, Hussan, Ashab

arXiv.org Artificial Intelligence

This research paper presents a comprehensive review-based study on various Text-to-Speech (TTS) technologies. TTS technology is an important aspect of human-computer interaction, enabling machines to convert written text into audible speech. The paper examines the different TTS technologies available, including concatenative TTS, formant synthesis TTS, and statistical parametric TTS. The study focuses on comparing the advantages and limitations of these technologies in terms of their naturalness of voice, the level of complexity of the system, and their suitability for different applications. In addition, the paper explores the latest advancements in TTS technology, including neural TTS and hybrid TTS. The findings of this research will provide valuable insights for researchers, developers, and users who want to understand the different TTS technologies and their suitability for specific applications.


less-known-facts-about-ai-voices-and-text-to-speech

#artificialintelligence

Voice artificial intelligence is an emerging technology that uses voice commands to interact with humans. The technology is witnessing tremendous growth and intense research in modern engineering to explore untapped areas. We are well accustomed to hearing AI voices narrating monotone articles and reports. One of the most trending examples of their use by many people is Alexa and Siri-enabled devices. These devices are getting significant recognition, and the market for similar products is growing exceptionally.


How innovations in voice have made it an end-to-end commerce channel

#artificialintelligence

Text-to-speech (TTS) technology isn't exactly new – but the way it's shaping the future certainly is. From smart speakers to voice assistants, TTS is increasingly paramount in day-to-day interactions between brands and end users, leading to enhanced brand experiences and better business outcomes. Up until recently, TTS was confined to a specific use case: voice-enablement of written content to make computers'speak' to those with visual or reading impairments. TTS technology was based on utility and a need to make screen-related content accessible. As such, synthetic speech was traditionally digital-sounding and marred by poor audio quality and speaking style.